Introduction
The dynamics of flight pricing is a subject that interests and
affects travelers around the globe. According to ICAO’s statistics, the
total number of passengers carried on scheduled services rose to 4.5
billion in 2019, which is 3.6 per cent increase from the previous year,
while the number of departures reached 38.3 million in 2019, a 1.7 per
cent increase [3]. The upward trend for air travel is expected to
continue for each passing year which is why it is important for people
to understand what factors are behind these flight prices. Understanding
these factors could save them some money as they go to off to see their
families or go on a well deserved break. In this report, we will be
tackling the questions below:
Does departure time affect the price of the air ticket? Which
time has the cheapest and most expensive flight ticket?
Does the duration of the flight affect the price of the air
ticket?
Will the price of the flights be affected by the days
left?
Data
The data we used to investigate our questions is extracted from
Clean_Dataset.csv in the Flight Price Prediction datasets from Kaggle.
The dataset was sourced from https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction/data.
The dataset is prepared and complied by Shubham Bathwa. The data in the
dataset is collected from “Ease My Trip” website. Both data for economy
class flight tickets and business class flight tickets that traveled
between India’s top 6 metro cities are extracted from the website. The
data was collected over a period of 50 days.
The dataset contains 300,153 entries, each representing a flight
ticket from the “Ease My Trip” website. The dataset consists of 12
columns and each columns contains a flight information from the flight
ticket. The information represented in the columns are airline company,
flight code, source city, departure time, number of stops, arrival time,
destination city, ticket class, flight duration, days left to the day of
the flight, and price of the ticket. The dataset contains information on
6 unique airlines, 1,561 unique planes, 6 unique departure cities, 6
unique departure times, 6 unique arrival times, 6 unique destination
cities, and 2 unique types of classes.
Data Cleaning and Pre-processing
The first column of the dataset represents the row numbers for each
entry. As this information is not needed, the column containing the row
numbers is removed. Several columns in the dataset, such as airline,
flight, source city, departure time, stops, arrival time, destination
city, and class, are of character type. These columns are factorized for
easier data manipulation and analysis. The cleaned dataset now contains
11 columns, where 8 columns are the “factor” data type, and 3 are the
double data type.
# Loading the data
data <- read_csv("archive/Clean_Dataset.csv")
New names:
• `` -> `...1`
Rows: 300153 Columns: 12
── Column specification ───────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (8): airline, flight, source_city, departure_time, stops, arrival_time, destination_city, c...
dbl (4): ...1, duration, days_left, price
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Remove unnecessary column and factorized columns
df <- data %>% select(-1) %>%
mutate(across(c(airline, flight, source_city, departure_time, stops, arrival_time, destination_city, class), as.factor))
# Structure of dataset
str(df)
tibble [300,153 × 11] (S3: tbl_df/tbl/data.frame)
$ airline : Factor w/ 6 levels "Air_India","AirAsia",..: 5 5 2 6 6 6 6 6 3 3 ...
$ flight : Factor w/ 1561 levels "6E-102","6E-105",..: 1409 1388 1214 1560 1550 1542 1534 1544 1014 1015 ...
$ source_city : Factor w/ 6 levels "Bangalore","Chennai",..: 3 3 3 3 3 3 3 3 3 3 ...
$ departure_time : Factor w/ 6 levels "Afternoon","Early_Morning",..: 3 2 2 5 5 5 5 1 2 1 ...
$ stops : Factor w/ 3 levels "one","two_or_more",..: 3 3 3 3 3 3 3 3 3 3 ...
$ arrival_time : Factor w/ 6 levels "Afternoon","Early_Morning",..: 6 5 2 1 5 1 5 3 5 3 ...
$ destination_city: Factor w/ 6 levels "Bangalore","Chennai",..: 6 6 6 6 6 6 6 6 6 6 ...
$ class : Factor w/ 2 levels "Business","Economy": 2 2 2 2 2 2 2 2 2 2 ...
$ duration : num [1:300153] 2.17 2.33 2.17 2.25 2.33 2.33 2.08 2.17 2.17 2.25 ...
$ days_left : num [1:300153] 1 1 1 1 1 1 1 1 1 1 ...
$ price : num [1:300153] 5953 5953 5956 5955 5955 ...
# Print the dataset
df
Exploratory Data Analysis
# Summary of the dataset
summary(df)
airline flight source_city departure_time stops
Air_India: 80892 UK-706 : 3235 Bangalore:52061 Afternoon :47794 one :250863
AirAsia : 16098 UK-772 : 2741 Chennai :38700 Early_Morning:66790 two_or_more: 13286
GO_FIRST : 23173 UK-720 : 2650 Delhi :61343 Evening :65102 zero : 36004
Indigo : 43120 UK-836 : 2542 Hyderabad:40806 Late_Night : 1306
SpiceJet : 9011 UK-822 : 2468 Kolkata :46347 Morning :71146
Vistara :127859 UK-828 : 2440 Mumbai :60896 Night :48015
(Other):284077
arrival_time destination_city class duration days_left
Afternoon :38139 Bangalore:51068 Business: 93487 Min. : 0.83 Min. : 1
Early_Morning:15417 Chennai :40368 Economy :206666 1st Qu.: 6.83 1st Qu.:15
Evening :78323 Delhi :57360 Median :11.25 Median :26
Late_Night :14001 Hyderabad:42726 Mean :12.22 Mean :26
Morning :62735 Kolkata :49534 3rd Qu.:16.17 3rd Qu.:38
Night :91538 Mumbai :59097 Max. :49.83 Max. :49
price
Min. : 1105
1st Qu.: 4783
Median : 7425
Mean : 20890
3rd Qu.: 42521
Max. :123071
Visualizations & Interpretation
Question 1:
The boxplot shows that the medium of each departure time is
relatively similar, ranging between 6,500 and 8,200 Rupees. Late-night
departure time has a slightly lower medium than other departure times,
which is 4499 Rupee. However, it is difficult to determine which
departure time has the highest median based solely on the boxplot.
Furthermore, the boxplot also illustrates that late-night departure time
has the smallest interquartile range in price. In contrast, night
departure time has the largest interquartile range in price, indicating
greater price variability in flight tickets.
# Boxplot for flight prices by departure time
ggplot(df, aes(x = departure_time, y = price)) +
geom_boxplot() +
labs(title = "Flight Prices by Departure Time", x = "Departure Time", y = "Price") +
theme_minimal()

After further analysis by calculating the mean price of the flight
tickets, the late night departure time still has the cheapest flight
tickets compared to other departure times, which is 9295.299 Rupee. On
the other hand, night departure time has the most expensive flight
tickets compared to the departure times, standing at the mean of
23062.147 Rupee.
# Calculate the mean price by departure time
df_departure <- df %>%
group_by(departure_time) %>%
summarize(mean_price = mean(price))
# Display the mean flight ticket price in a table
kable(df_departure, caption = "Mean Price by Departure Time")
Mean Price by Departure Time
| Afternoon |
18179.203 |
| Early_Morning |
20370.677 |
| Evening |
21232.362 |
| Late_Night |
9295.299 |
| Morning |
21630.760 |
| Night |
23062.147 |
The histogram for night departure time prices demonstrates that the
data is right-skewed, with more records of flight ticket prices on the
lower end of the price range. However, the long tail of the histogram
does extend to higher prices. The histogram also has a smaller peak,
around 30,000 to 70,000 Rupees. This indicates that flight tickets with
night departure times are typically in the price range of the two
peaks.
# Get the entries for flights with Night departure times
df_night <- df %>% filter(departure_time == "Night")
# Plot histogram to investigate the distribution of the Night departure times flight price
ggplot(df_night, aes(x = price)) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
labs(title = "Histogram of Night Flight Prices", x = "Price", y = "Frequency") +
theme_minimal()

The histogram for late-night flight ticket prices illustrates that
the data are heavily concentrated on the lower end. There are only a few
isolated outliers where flight prices exceed 20,000 Rupees. The
histogram also indicates that the frequency of late-night flight tickets
was drastically lower than the night departure times of flight
tickets.
# Get the entries for flights with Late Night departure times
df_late <- df %>% filter(departure_time == "Late_Night")
# Plot histogram to investigate the distribution of the Late Night departure times flight price
ggplot(df_late, aes(x = price)) +
geom_histogram(bins =30, fill = "lightblue", color = "black") +
labs(title = "Histogram of Late Night Flight Prices", x = "Price", y = "Frequency") +
theme_minimal()

Question 2:
The scatterplot reveals that there are two different clusters present
in the graph. Many of the points are concentrated on the lower left of
the graph, which indicates that many of the flight tickets are lower in
price and the flights have shorter duration. The positively sloped
trendline also demonstrates that the duration of the flights and the
price of flight tickets are positively correlated, where the longer the
flight duration, the higher the price of the flight tickets.
# Plot scatter and density plot to investigate the relationship between duration and the prices of the flight
ggplot(df, aes(x=duration, y=price)) +
geom_pointdensity(size = 0.5, alpha=0.05) +
scale_color_viridis_c() +
geom_smooth(method = "lm", formula = y ~ x, color = "red") +
labs(title = "Duration vs Price")

To further examine the relationship between the two clusters in the
scatterplot, we coloured the points based on the class of the flight
tickets. After applying the colour differentiation, the scatterplot
shows that the two clusters belong to the two ticket classes. Pink
cluster for business class flight tickets and blue cluster for economy
flight tickets. Business flight tickets are generally more expensive
than economy flight tickets. Thus, the cluster belonging to business
flight tickets is slightly higher in the graph than economy flight
tickets. The different ticket classes also explain the extensive range
of prices for the same flight duration.
# Scatterplot to show the class of the flight tickets
ggplot(df, aes(x = duration, y = price, color = class)) +
geom_point(size = 0.2, alpha = 0.3) +
geom_smooth(method = "lm", formula = y ~ x, color = "red") +
labs(title = "Duration vs Price", x = "Duration", y = "Price", color = "Class") +
theme_minimal()

The scatter-density plot for business-class flight tickets reveals
two main clusters of points. These clusters are centered around 25,000
and 60,000 Rupees. Both clusters fall within the duration range of 0 to
20 hours. The trend line in the graph highlights a positive relationship
between flight duration and ticket price, where longer flights generally
have higher prices for business-class tickets.
# Get the entries for business-class flight tickets
df_business <- df %>% filter(class == "Business")
# Plot hexagonal heatmap to investigate the relationship between duration and the prices of Business class flights
ggplot(df_business, aes(x = duration, y = price)) +
geom_pointdensity(size = 0.5, alpha = 0.5) +
scale_color_viridis_c() +
geom_smooth(method = "lm", formula = y ~ x, color = "red") +
labs(title = "Business Class's Duration vs Price", x = "Duration", y = "Price") +
theme_minimal()

Similar to the scatter-density plot for business-class flight
tickets, the graph has two main clusters for economy-class flight
tickets. One cluster is located around the price level of 2500 Rupees,
and the second cluster is around 50,000 Rupees. Most flights have
durations shorter than 20 hours, as indicated by the higher density of
points within this range. The trend line, similar to the previous graph,
has a positive relationship between flight duration and ticket
price.
# Get the entries for economy-class flight tickets
df_economy <- df %>% filter(class == "Economy")
# Plot hexagonal heatmap to investigate the relationship between duration and the prices of Economy class flights
ggplot(df_economy, aes(x = duration, y = price)) +
geom_pointdensity(size = 0.5, alpha = 0.5) +
scale_color_viridis_c() +
geom_smooth(method = "lm", formula = y ~ x, color = "red") +
labs(title = "Economy Class's Duration vs Price", x = "Duration", y = "Price") +
theme_minimal()

Question 3:
Below, we created a scatter plot to see if there is a correlation
between the days left from departure date and the price of the flight
ticket. Further more, we also categorized the plot by the source and
destination cities since we did not have individual flight numbers to
track.
library(ggplot2)
# Adding a new column "source_dest" which is a combination of
# source and destination city
combined_cities <- df %>% unite(source_dest, source_city, destination_city, sep = "_", remove = FALSE)
# Define the plotting function
plot_for_category <- function(cat, data) {
subset_data <- subset(data, source_dest == cat)
ggplot(subset_data, aes(x = days_left, y = price)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", color = "red") +
labs(
title = paste("Plot for Category:", cat),
x = "Days before Departure",
y = "Price (Rupees)"
) +
theme_minimal()
}
# Get unique categories
unique_categories <- unique(combined_cities$source_dest)
# Generate and store plots for each category
plots <- lapply(unique_categories, function(cat) plot_for_category(cat, combined_cities))
# Display all plots
for (plot in plots) {
plot + ylim(0, 100000)
print(plot)
}
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'






























Machine Learning Models
First, we will be splitting the dataset into the training and testing
set to build our machine learning models. The training set will have 70
percent of the dataset and the testing set will have 30 percent of the
dataset. We also chose to create a sample training data of 10000 rows,
as it was taking our computers too long to compute the models using the
larger dataset.
set.seed(123)
smp_size <- floor(0.7 * nrow(df))
row_index <- sample(1: nrow(df), size = smp_size)
train_smp <- df[row_index, ]
test_smp <- df[-row_index, ]
sample_data <- train_smp[sample(nrow(train_smp), 10000), ]
Tree-Based Models
The first model we are using to predict the flight prices is the
tree-based model. In the model, we used categorical variable such as
departure time of the flight and continuous variables such as duration
of the flights and days left till the day of the flight. However, this
model did not do well in predicting the prices as the mean absolute
error is high and correlation is weak.
# Tree-Based Models
ar <- rpart(price ~ departure_time + duration + days_left,
train_smp)
preds <- predict(ar, test_smp)
# Calculate Mean Absolute Error (MAE)
mae <- mean(abs(preds - test_smp$price))
mae
[1] 18879.4
# Calculate correlation coefficient
cr <- cor(preds,test_smp$price)
cr
[1] 0.2592221
# Get the optimal cp value
optimal_cp <- ar$cptable[which.min(ar$cptable[,"xerror"]), "CP"]
# Prune the tree
pruned_tree <- prune(ar, cp = optimal_cp)
# Visualize the tree
rpart.plot(pruned_tree, box.palette = "Blues", main = "Simplified Decision Tree")

To make the model more accurate, we trained the tree-based model with
more independent variables. After retraining the model, the model now
produce predictions with extremely strong correlation and lower mean
absolute error compare to the last model.
# Train model with more variables
ar2 <- rpart(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city,
train_smp)
preds <- predict(ar2, test_smp)
# Calculate Mean Absolute Error (MAE)
mae <- mean(abs(preds - test_smp$price))
mae
[1] 4315.176
# Calculate correlation coefficient
cr <- cor(preds,test_smp$price)
cr
[1] 0.9569965
The plot shows that the flight class is the main predictor of the
flight ticket’s price. The duration of the flight further influences the
price of the business class flight.
# Get the optimal cp value
optimal_cp2 <- ar2$cptable[which.min(ar2$cptable[,"xerror"]), "CP"]
# Prune the tree
pruned_tree2 <- prune(ar2, cp = optimal_cp2)
# Visualize the tree
rpart.plot(pruned_tree2, box.palette = "Blues", main = "Simplified Decision Tree")

Linear Models
The next model we will be using to predict flight prices will be a
linear model.
# Remove Flight column from data frame
sample_data <- subset(sample_data, select = -c(flight))
#Get formula for LM
dependent_vars <- setdiff(names(sample_data), "price")
formula <- as.formula(paste("price ~", paste(dependent_vars, collapse = " + ")))
# Build the linear model
linear_model <- lm(formula, data = sample_data)
# Get predictions and MAE
lm_preds <- predict(linear_model, test_smp)
lm_mae <- mean(abs(lm_preds - test_smp$price))
lm_mae
[1] 4560.298
# Calculate correlation coefficient
lm_cr <- cor(lm_preds,test_smp$price)
lm_cr
[1] 0.9543588
Support Vector Machine
Support vector machine is one of the model we are using to predict
the price of the plane tickets. Independent variables such as departure
time of the flight, duration of the flight, days left to the flight,
number of stops, source city, airline, class, and destination_city are
used to predict the price of the flight tickets. Three different kernel
were used to find the suitable model. The radial SVM model performed the
best in predicting the flight prices as the model has the smallest mean
absolute error and highest correlation coefficient.
# linear svm model
s <- svm(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city,
sample_data,
kernel="linear")
s_preds <- predict(s, test_smp)
mae_s <- mean(abs(s_preds - test_smp$price))
mae_s
[1] 4183.375
cr_s <- cor(s_preds,test_smp$price)
cr_s
[1] 0.9514206
# radial svm model
r <- svm(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city,
sample_data,
kernel="radial")
r_preds <- predict(r, test_smp)
mae_r <- mean(abs(r_preds - test_smp$price))
mae_r
[1] 3139.727
cr_r <- cor(r_preds,test_smp$price)
cr_r
[1] 0.9731739
# poly svm model
poly <- svm(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city,
sample_data,
kernel="polynomial")
poly_preds <- predict(poly, test_smp)
mae_p <- mean(abs(poly_preds - test_smp$price))
mae_p
[1] 7739.832
cr_p <- cor(poly_preds,test_smp$price)
cr_p
[1] 0.91568
After tuning the radial SVM model, the model became more accurate as
the mean absolute error decreased and the correlation coefficient
increased. The best parameters for gamma and cost are 0.1 and 10,
respectively. A smaller random subset of 5000 entries was used to train
the fine-tuned model for efficiency.
# Take a smaller sample
sample_data2 <- train_smp[sample(nrow(train_smp), 5000), ]
# Find the best parameters
p <- tune.svm(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city,
data = sample_data2,
gamma=c(0.01, 0.1, 1),
cost=c(1, 5, 10) ,
kernel = "radial")
# Train the model with the best parameters
new_r <- svm(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city,
sample_data,
kernel="radial",
gamma=p$best.parameters$gamma,
cost=p$best.parameters$cost)
new_r_preds <- predict(new_r, test_smp)
# Calculate Mean Absolute Error (MAE)
new_mae_r <- mean(abs(new_r_preds - test_smp$price))
new_mae_r
[1] 2851.704
# Calculate correlation coefficient
new_cr_r <- cor(new_r_preds,test_smp$price)
new_cr_r
[1] 0.9783083
Conclusion
For this project we sought out to see if there are factors
influencing the flight prices. For our first question, we ask if there
was a correlation between the time of departure and the price of the
tickets. From a number of visualizations that we did, we can see that
there is definitely a correlation as late night tickets are usually the
cheapest with very little variability in price with a mean price of
9295.3 Rupees. For the remaining departure times, we can see that there
is not much difference in prices as the means of the rest of the
departure times center around 21000 Rupees.
For our second question, we tried to see if there was a correlation
between the duration of the flight and the ticket of the price. From
just looking a the trend line from the scatter plot, we can see a
positive correlation between the duration and the price of the ticket.
Upon further inspection of the scatter plot, after differentiating the
classes of the tickets, we can see the business class tickets get even
more expensive with duration than economy tickets.
For our last question, we wanted to see if there was a relation
between the days left from departure and the price of the ticket. We
used a scatter plot categorized by source and destination city and for
every plot we can see that there is a negative correlation between the
price and the days left from departure, the further the departure date
is the cheaper the tickets.
For our first tree model, we used only 3 indepandent variables
(departure time, duration, days_left) with our target variable, price,
to train our model. This did not lead to optimal results as we got a
mean absolute error of 18879.4 and a correlation coefficient of 0.2592.
So for our second tree model, we used all the variables in the data set
to train the model and we received better results with a mean absolute
error of 4315.176 and a correlation coefficient of 0.957.
For our linear model, we used all the variables in the data set to
train the model and a mean absolute error of 4560.298 and a correlation
coefficient of 0.954.
For our SVMs, first we trained 1 linear SVM, 1 radial SVM and 1
polynomial SVM using all the variables from the dataset. The table below
shows the mean absolute value and the correlation coefficient for each
of these models.
Mean absolute value Correlation coefficient
Linear SVM 4183.375 0.9514206
Radial SVM 3139.727 0.9731739
Polynomial SVM 7739.832 0.91568
We decided to tune the Radial SVM to make it more accurate by using
the best gamma and cost values. After tuning the models, the MAE
decreased to 2851.704 and the correlation coefficient rose to
0.9783.
After evaluating all our models, we can see that Radial SVM has the
best performance overall even before the tuning with a MAE of 3139.727
and a correlation coefficient of 0.973.
From this project, we can conclude that there are definitely factors
such as departure time, duration and the days left from departure
affecting the price of flight tickets.
Citations
[1] jan-glx, “Scatterplot with too many points,” Stack Overflow,
Oct. 10, 2011. https://stackoverflow.com/questions/7714677/scatterplot-with-too-many-points/58523956#58523956
[2] “How to Prune a Tree in R?,” GeeksforGeeks, Jun. 13, 2024. https://www.geeksforgeeks.org/how-to-prune-a-tree-in-r/
[3] https://www.icao.int/annual-report-2019/Pages/the-world-of-air-transport-in-2019.aspx#:~:text=The%204.5%20billion%20scheduled%20passengers,some%2090%20million%20in%202040.&text=The%20world’s%20major%20manufacturers%20delivered,822%20new%20aircraft%20net%20orders.
---
title: "Flight Price Analysis and Prediction"
author: "Thune Kyae Sin Su - B00868806, Yuki Law - B00865885"
group: "Hangry and Angry"
output: html_notebook
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(knitr)
library(ggpointdensity)
library(DMwR2)
library(rpart.plot)
library(rpart)
library(e1071)
```

## Introduction

The dynamics of flight pricing is a subject that interests and affects travelers around the globe. According to ICAO's statistics, the total number of passengers carried on scheduled services rose to 4.5 billion in 2019, which is 3.6 per cent increase from the previous year, while the number of departures reached 38.3 million in 2019, a 1.7 per cent increase [3]. The upward trend for air travel is expected to continue for each passing year which is why it is important for people to understand what factors are behind these flight prices. Understanding these factors could save them some money as they go to off to see their families or go on a well deserved break. In this report, we will be tackling the questions below:

1.  Does departure time affect the price of the air ticket? Which time has the cheapest and most expensive flight ticket?

2.  Does the duration of the flight affect the price of the air ticket?

3.  Will the price of the flights be affected by the days left?

## Data

The data we used to investigate our questions is extracted from Clean_Dataset.csv in the Flight Price Prediction datasets from Kaggle. The dataset was sourced from <https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction/data>. The dataset is prepared and complied by Shubham Bathwa. The data in the dataset is collected from “Ease My Trip” website. Both data for economy class flight tickets and business class flight tickets that traveled between India's top 6 metro cities are extracted from the website. The data was collected over a period of 50 days.

The dataset contains 300,153 entries, each representing a flight ticket from the “Ease My Trip” website. The dataset consists of 12 columns and each columns contains a flight information from the flight ticket. The information represented in the columns are airline company, flight code, source city, departure time, number of stops, arrival time, destination city, ticket class, flight duration, days left to the day of the flight, and price of the ticket. The dataset contains information on 6 unique airlines, 1,561 unique planes, 6 unique departure cities, 6 unique departure times, 6 unique arrival times, 6 unique destination cities, and 2 unique types of classes.

## Data Cleaning and Pre-processing

The first column of the dataset represents the row numbers for each entry. As this information is not needed, the column containing the row numbers is removed. Several columns in the dataset, such as airline, flight, source city, departure time, stops, arrival time, destination city, and class, are of character type. These columns are factorized for easier data manipulation and analysis. The cleaned dataset now contains 11 columns, where 8 columns are the "factor" data type, and 3 are the double data type.

```{r}
# Loading the data
data <- read_csv("archive/Clean_Dataset.csv")

# Remove unnecessary column and factorized columns
df <- data %>% select(-1) %>%
  mutate(across(c(airline, flight, source_city, departure_time, stops, arrival_time, destination_city, class), as.factor))

# Structure of dataset
str(df)

# Print the dataset
df
```

## Exploratory Data Analysis

```{r}
# Summary of the dataset
summary(df)
```

## Visualizations & Interpretation

Question 1:

The boxplot shows that the medium of each departure time is relatively similar, ranging between 6,500 and 8,200 Rupees. Late-night departure time has a slightly lower medium than other departure times, which is 4499 Rupee. However, it is difficult to determine which departure time has the highest median based solely on the boxplot. Furthermore, the boxplot also illustrates that late-night departure time has the smallest interquartile range in price. In contrast, night departure time has the largest interquartile range in price, indicating greater price variability in flight tickets.

```{r}
# Boxplot for flight prices by departure time
ggplot(df, aes(x = departure_time, y = price)) +
  geom_boxplot() +
  labs(title = "Flight Prices by Departure Time", x = "Departure Time", y = "Price") +
  theme_minimal()
```

After further analysis by calculating the mean price of the flight tickets, the late night departure time still has the cheapest flight tickets compared to other departure times, which is 9295.299 Rupee. On the other hand, night departure time has the most expensive flight tickets compared to the departure times, standing at the mean of 23062.147 Rupee.

```{r}
# Calculate the mean price by departure time
df_departure <- df %>%
  group_by(departure_time) %>%
  summarize(mean_price = mean(price))

# Display the mean flight ticket price in a table
kable(df_departure, caption = "Mean Price by Departure Time")
```

The histogram for night departure time prices demonstrates that the data is right-skewed, with more records of flight ticket prices on the lower end of the price range. However, the long tail of the histogram does extend to higher prices. The histogram also has a smaller peak, around 30,000 to 70,000 Rupees. This indicates that flight tickets with night departure times are typically in the price range of the two peaks.

```{r}
# Get the entries for flights with Night departure times
df_night <- df %>% filter(departure_time == "Night")

# Plot histogram to investigate the distribution of the Night departure times flight price
ggplot(df_night, aes(x = price)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  labs(title = "Histogram of Night Flight Prices", x = "Price", y = "Frequency") +
  theme_minimal()
```

The histogram for late-night flight ticket prices illustrates that the data are heavily concentrated on the lower end. There are only a few isolated outliers where flight prices exceed 20,000 Rupees. The histogram also indicates that the frequency of late-night flight tickets was drastically lower than the night departure times of flight tickets.

```{r}
# Get the entries for flights with Late Night departure times
df_late <- df %>% filter(departure_time == "Late_Night")

# Plot histogram to investigate the distribution of the Late Night departure times flight price
ggplot(df_late, aes(x = price)) +
  geom_histogram(bins =30, fill = "lightblue", color = "black") +
  labs(title = "Histogram of Late Night Flight Prices", x = "Price", y = "Frequency") +
  theme_minimal()
```

Question 2:

The scatterplot reveals that there are two different clusters present in the graph. Many of the points are concentrated on the lower left of the graph, which indicates that many of the flight tickets are lower in price and the flights have shorter duration. The positively sloped trendline also demonstrates that the duration of the flights and the price of flight tickets are positively correlated, where the longer the flight duration, the higher the price of the flight tickets.

```{r}
# Plot scatter and density plot to investigate the relationship between duration and the prices of the flight 
ggplot(df, aes(x=duration, y=price)) + 
  geom_pointdensity(size = 0.5, alpha=0.05) + 
  scale_color_viridis_c() +
  geom_smooth(method = "lm", formula = y ~ x, color = "red") +
  labs(title = "Duration vs Price")
```

To further examine the relationship between the two clusters in the scatterplot, we coloured the points based on the class of the flight tickets. After applying the colour differentiation, the scatterplot shows that the two clusters belong to the two ticket classes. Pink cluster for business class flight tickets and blue cluster for economy flight tickets. Business flight tickets are generally more expensive than economy flight tickets. Thus, the cluster belonging to business flight tickets is slightly higher in the graph than economy flight tickets. The different ticket classes also explain the extensive range of prices for the same flight duration.

```{r}
# Scatterplot to show the class of the flight tickets
ggplot(df, aes(x = duration, y = price, color = class)) +
  geom_point(size = 0.2, alpha = 0.3) +
  geom_smooth(method = "lm", formula = y ~ x, color = "red") +
  labs(title = "Duration vs Price", x = "Duration", y = "Price", color = "Class") +
  theme_minimal()
```

The scatter-density plot for business-class flight tickets reveals two main clusters of points. These clusters are centered around 25,000 and 60,000 Rupees. Both clusters fall within the duration range of 0 to 20 hours. The trend line in the graph highlights a positive relationship between flight duration and ticket price, where longer flights generally have higher prices for business-class tickets.

```{r}
# Get the entries for business-class flight tickets
df_business <- df %>% filter(class == "Business")

# Plot hexagonal heatmap to investigate the relationship between duration and the prices of Business class flights
ggplot(df_business, aes(x = duration, y = price)) + 
  geom_pointdensity(size = 0.5, alpha = 0.5) + 
  scale_color_viridis_c() +
  geom_smooth(method = "lm", formula = y ~ x, color = "red") +
  labs(title = "Business Class's Duration vs Price", x = "Duration", y = "Price") +
  theme_minimal()
```

Similar to the scatter-density plot for business-class flight tickets, the graph has two main clusters for economy-class flight tickets. One cluster is located around the price level of 2500 Rupees, and the second cluster is around 50,000 Rupees. Most flights have durations shorter than 20 hours, as indicated by the higher density of points within this range. The trend line, similar to the previous graph, has a positive relationship between flight duration and ticket price.

```{r}
# Get the entries for economy-class flight tickets
df_economy <- df %>% filter(class == "Economy")

# Plot hexagonal heatmap to investigate the relationship between duration and the prices of Economy class flights
ggplot(df_economy, aes(x = duration, y = price)) + 
  geom_pointdensity(size = 0.5, alpha = 0.5) + 
  scale_color_viridis_c() +
  geom_smooth(method = "lm", formula = y ~ x, color = "red") +
  labs(title = "Economy Class's Duration vs Price", x = "Duration", y = "Price") +
  theme_minimal()
```

Question 3:

Below, we created a scatter plot to see if there is a correlation between the days left from departure date and the price of the flight ticket. Further more, we also categorized the plot by the source and destination cities since we did not have individual flight numbers to track.

```{r}
library(ggplot2)

# Adding a new column "source_dest" which is a combination of 
# source and destination city
combined_cities <- df %>% unite(source_dest, source_city, destination_city, sep = "_", remove = FALSE)

# Define the plotting function
plot_for_category <- function(cat, data) {
  subset_data <- subset(data, source_dest == cat)
  
  ggplot(subset_data, aes(x = days_left, y = price)) +
    geom_point(color = "blue") +
    geom_smooth(method = "lm", color = "red") +
    labs(
      title = paste("Plot for Category:", cat),
      x = "Days before Departure",
      y = "Price (Rupees)"
    ) +
    theme_minimal()
}

# Get unique categories
unique_categories <- unique(combined_cities$source_dest)

# Generate and store plots for each category
plots <- lapply(unique_categories, function(cat) plot_for_category(cat, combined_cities))

# Display all plots 
for (plot in plots) {
  plot + ylim(0, 100000)
  print(plot)
}
```

## Machine Learning Models

First, we will be splitting the dataset into the training and testing set to build our machine learning models. The training set will have 70 percent of the dataset and the testing set will have 30 percent of the dataset. We also chose to create a sample training data of 10000 rows, as it was taking our computers too long to compute the models using the larger dataset.

```{r}
set.seed(123)
smp_size <- floor(0.7 * nrow(df))
row_index <- sample(1: nrow(df), size = smp_size)
train_smp <- df[row_index, ]
test_smp <- df[-row_index, ]
sample_data <- train_smp[sample(nrow(train_smp), 10000), ]
```

### Tree-Based Models

The first model we are using to predict the flight prices is the tree-based model. In the model, we used categorical variable such as departure time of the flight and continuous variables such as duration of the flights and days left till the day of the flight. However, this model did not do well in predicting the prices as the mean absolute error is high and correlation is weak.

```{r}
# Tree-Based Models
ar <- rpart(price ~ departure_time + duration + days_left, 
            train_smp)
preds <- predict(ar, test_smp)

# Calculate Mean Absolute Error (MAE)
mae <- mean(abs(preds - test_smp$price))
mae

# Calculate correlation coefficient
cr <- cor(preds,test_smp$price)
cr

# Get the optimal cp value
optimal_cp <- ar$cptable[which.min(ar$cptable[,"xerror"]), "CP"]

# Prune the tree
pruned_tree <- prune(ar, cp = optimal_cp)

# Visualize the tree
rpart.plot(pruned_tree, box.palette = "Blues", main = "Simplified Decision Tree")
```

To make the model more accurate, we trained the tree-based model with more independent variables. After retraining the model, the model now produce predictions with extremely strong correlation and lower mean absolute error compare to the last model.

```{r}
# Train model with more variables
ar2 <- rpart(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city,
            train_smp)
preds <- predict(ar2, test_smp)

# Calculate Mean Absolute Error (MAE)
mae <- mean(abs(preds - test_smp$price))
mae

# Calculate correlation coefficient
cr <- cor(preds,test_smp$price)
cr
```

The plot shows that the flight class is the main predictor of the flight ticket's price. The duration of the flight further influences the price of the business class flight.

```{r}
# Get the optimal cp value
optimal_cp2 <- ar2$cptable[which.min(ar2$cptable[,"xerror"]), "CP"]

# Prune the tree
pruned_tree2 <- prune(ar2, cp = optimal_cp2)

# Visualize the tree
rpart.plot(pruned_tree2, box.palette = "Blues", main = "Simplified Decision Tree")
```

## Linear Models

The next model we will be using to predict flight prices will be a linear model.

```{r}
# Remove Flight column from data frame
sample_data <- subset(sample_data, select = -c(flight))

#Get formula for LM
dependent_vars <- setdiff(names(sample_data), "price")
formula <- as.formula(paste("price ~", paste(dependent_vars, collapse = " + ")))

# Build the linear model
linear_model <- lm(formula, data = sample_data)
```

```{r}
# Get predictions and MAE
lm_preds <- predict(linear_model, test_smp)
lm_mae <- mean(abs(lm_preds - test_smp$price))
lm_mae

# Calculate correlation coefficient
lm_cr <- cor(lm_preds,test_smp$price)
lm_cr
```

## Support Vector Machine

Support vector machine is one of the model we are using to predict the price of the plane tickets. Independent variables such as departure time of the flight, duration of the flight, days left to the flight, number of stops, source city, airline, class, and destination_city are used to predict the price of the flight tickets. Three different kernel were used to find the suitable model. The radial SVM model performed the best in predicting the flight prices as the model has the smallest mean absolute error and highest correlation coefficient.

```{r}
# linear svm model
s <- svm(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city, 
         sample_data, 
         kernel="linear")
s_preds <- predict(s, test_smp)
mae_s <- mean(abs(s_preds - test_smp$price))
mae_s
cr_s <- cor(s_preds,test_smp$price)
cr_s

# radial svm model
r <- svm(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city, 
         sample_data, 
         kernel="radial")
r_preds <- predict(r, test_smp)
mae_r <- mean(abs(r_preds - test_smp$price))
mae_r
cr_r <- cor(r_preds,test_smp$price)
cr_r

# poly svm model
poly <- svm(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city, 
            sample_data, 
            kernel="polynomial")
poly_preds <- predict(poly, test_smp)
mae_p <- mean(abs(poly_preds - test_smp$price))
mae_p
cr_p <- cor(poly_preds,test_smp$price)
cr_p
```

After tuning the radial SVM model, the model became more accurate as the mean absolute error decreased and the correlation coefficient increased. The best parameters for gamma and cost are 0.1 and 10, respectively. A smaller random subset of 5000 entries was used to train the fine-tuned model for efficiency.

```{r}
# Take a smaller sample
sample_data2 <- train_smp[sample(nrow(train_smp), 5000), ]

# Find the best parameters
p <- tune.svm(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city, 
              data = sample_data2, 
              gamma=c(0.01, 0.1, 1), 
              cost=c(1, 5, 10) , 
              kernel = "radial") 

# Train the model with the best parameters
new_r <- svm(price ~ departure_time + duration + days_left + stops + source_city + airline + class + destination_city, 
             sample_data, 
             kernel="radial", 
             gamma=p$best.parameters$gamma, 
             cost=p$best.parameters$cost)
new_r_preds <- predict(new_r, test_smp)

# Calculate Mean Absolute Error (MAE)
new_mae_r <- mean(abs(new_r_preds - test_smp$price))
new_mae_r

# Calculate correlation coefficient
new_cr_r <- cor(new_r_preds,test_smp$price)
new_cr_r
```

## Conclusion

For this project we sought out to see if there are factors influencing the flight prices. For our first question, we ask if there was a correlation between the time of departure and the price of the tickets. From a number of visualizations that we did, we can see that there is definitely a correlation as late night tickets are usually the cheapest with very little variability in price with a mean price of 9295.3 Rupees. For the remaining departure times, we can see that there is not much difference in prices as the means of the rest of the departure times center around 21000 Rupees.

For our second question, we tried to see if there was a correlation between the duration of the flight and the ticket of the price. From just looking a the trend line from the scatter plot, we can see a positive correlation between the duration and the price of the ticket. Upon further inspection of the scatter plot, after differentiating the classes of the tickets, we can see the business class tickets get even more expensive with duration than economy tickets.

For our last question, we wanted to see if there was a relation between the days left from departure and the price of the ticket. We used a scatter plot categorized by source and destination city and for every plot we can see that there is a negative correlation between the price and the days left from departure, the further the departure date is the cheaper the tickets.

For our first tree model, we used only 3 indepandent variables (departure time, duration, days_left) with our target variable, price, to train our model. This did not lead to optimal results as we got a mean absolute error of 18879.4 and a correlation coefficient of 0.2592. So for our second tree model, we used all the variables in the data set to train the model and we received better results with a mean absolute error of 4315.176 and a correlation coefficient of 0.957.

For our linear model, we used all the variables in the data set to train the model and a mean absolute error of 4560.298 and a correlation coefficient of 0.954.

For our SVMs, first we trained 1 linear SVM, 1 radial SVM and 1 polynomial SVM using all the variables from the dataset. The table below shows the mean absolute value and the correlation coefficient for each of these models.

```         
                Mean absolute value         Correlation coefficient 
Linear SVM       4183.375                         0.9514206                        
Radial SVM       3139.727                         0.9731739                       
Polynomial SVM   7739.832                         0.91568           
```

We decided to tune the Radial SVM to make it more accurate by using the best gamma and cost values. After tuning the models, the MAE decreased to 2851.704 and the correlation coefficient rose to 0.9783.

After evaluating all our models, we can see that Radial SVM has the best performance overall even before the tuning with a MAE of 3139.727 and a correlation coefficient of 0.973.

From this project, we can conclude that there are definitely factors such as departure time, duration and the days left from departure affecting the price of flight tickets.

Citations

[1] jan-glx, “Scatterplot with too many points,” Stack Overflow, Oct. 10, 2011. <https://stackoverflow.com/questions/7714677/scatterplot-with-too-many-points/58523956#58523956>

[2] “How to Prune a Tree in R?,” GeeksforGeeks, Jun. 13, 2024. <https://www.geeksforgeeks.org/how-to-prune-a-tree-in-r/>

[3] <https://www.icao.int/annual-report-2019/Pages/the-world-of-air-transport-in-2019.aspx#>:\~:text=The%204.5%20billion%20scheduled%20passengers,some%2090%20million%20in%202040.&text=The%20world's%20major%20manufacturers%20delivered,822%20new%20aircraft%20net%20orders.
